The topic diversity of open-domain videos leads to various vocabularies andlinguistic expressions in describing video contents, and therefore, makes thevideo captioning task even more challenging. In this paper, we propose anunified caption framework, M&M TGM, which mines multimodal topics inunsupervised fashion from data and guides the caption decoder with thesetopics. Compared to pre-defined topics, the mined multimodal topics are moresemantically and visually coherent and can reflect the topic distribution ofvideos better. We formulate the topic-aware caption generation as a multi-tasklearning problem, in which we add a parallel task, topic prediction, inaddition to the caption task. For the topic prediction task, we use the minedtopics as the teacher to train a student topic prediction model, which learnsto predict the latent topics from multimodal contents of videos. The topicprediction provides intermediate supervision to the learning process. As forthe caption task, we propose a novel topic-aware decoder to generate moreaccurate and detailed video descriptions with the guidance from latent topics.The entire learning procedure is end-to-end and it optimizes both taskssimultaneously. The results from extensive experiments conducted on the MSR-VTTand Youtube2Text datasets demonstrate the effectiveness of our proposed model.M&M TGM not only outperforms prior state-of-the-art methods on multipleevaluation metrics and on both benchmark datasets, but also achieves bettergeneralization ability.
展开▼